-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing extension sequencing in azure linux agent #1298
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general feedback is that an already complex section of the code has become harder to understand. I would prefer that when code is touched, it gets simpler. Let me know whether you think this is feasible.
Also some comments and questions inline.
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -67,6 +67,7 @@ | |||
|
|||
TELEMETRY_MESSAGE_MAX_LEN = 3200 | |||
|
|||
DEFAULT_EXT_TIMEOUT_MINUTES = 90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is a per-extension timeout, won't CRP timeout well before this for multiple extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the timeout value per dependency level.
CRP has an independent time out value. I will start an email thread including the CRP-side owner, on when/how to update the CRP timeout value.
azurelinuxagent/ga/exthandlers.py
Outdated
if dep_level != ext_handler.sort_key(): | ||
dep_level = ext_handler.sort_key() | ||
deps_wait_until = handlers_wait_until | ||
handlers_wait_until += datetime.timedelta(seconds=(DEFAULT_EXT_TIMEOUT_MINUTES * 60)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - doesn't timedelta have a min property?
azurelinuxagent/ga/exthandlers.py
Outdated
dependencies = sum([e.dependencies for e in ext_handler.properties.extensions], []) | ||
for dependency in dependencies: | ||
handler_i = ExtHandlerInstance(dependency.handler, self.protocol) | ||
if dependency.timeout is not None and dependency.timeout < DEFAULT_EXT_TIMEOUT_MINUTES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not keep converting between seconds and minutes, it's an easy thing to miss.
azurelinuxagent/ga/exthandlers.py
Outdated
wait_until -= datetime.timedelta(seconds=( (DEFAULT_EXT_TIMEOUT_MINUTES - dependency.timeout) * 60)) | ||
for ext in dependency.exts: | ||
success_status, status = handler_i.is_ext_status_success(ext) | ||
while not success_status: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strange construct - a while not
around an if
- please clean up.
azurelinuxagent/ga/exthandlers.py
Outdated
def is_ext_status_success(self, ext): | ||
status = self.collect_ext_status(ext) | ||
if status is None: | ||
return (True, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd, why is this True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some extensions do not report any status. We should not wait for such extensions to report status. That's why we return True here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But by returning True here we are saying that a extension that has not reported status yet has succeeded... correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think we can return False in this case. @uiri do you have any comment on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect_ext_status
returns None if the extension does not report any status. Not all extensions report status.
If you return False
in this case, the agent will hang until it times out on such extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base on offline discussion with @uiri , an extension can install and enable but may not create any status file. In this case, we don't want to wait for it until it is timed out. So we return True in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBD: What should we return if there is no status file?
azurelinuxagent/ga/exthandlers.py
Outdated
for substatus in status.substatusList: | ||
if substatus.status != "success": | ||
return (False, status) | ||
return (True, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is your default, you are not really testing for success, but failure.
azurelinuxagent/ga/exthandlers.py
Outdated
dependency.handler.name, ext.name, | ||
ext_handler.name, status)) | ||
else: | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a long wait.
azurelinuxagent/ga/exthandlers.py
Outdated
dep_level = ext_handler.sort_key() | ||
deps_wait_until = handlers_wait_until | ||
handlers_wait_until += datetime.timedelta(seconds=(DEFAULT_EXT_TIMEOUT_MINUTES * 60)) | ||
self.wait_on_ext_handler_dependencies(ext_handler, deps_wait_until) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure I agree with the way this is done, which is the feedback I gave @uiri previously. If the dependency chain is simple, this should be just an ordering problem. The way this is built allows for a more complex scenario (A depends on B and C), which was not my expectation. Please clarify.
@sirajudeens please address the conflicts as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on 289f2fe
azurelinuxagent/ga/exthandlers.py
Outdated
else: | ||
time.sleep(10) | ||
success_status, status = handler_i.is_ext_status_success(ext) | ||
success_status, status = handler_i.is_ext_status_success(dependency.ext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing None
check for dependency.ext
.
azurelinuxagent/ga/exthandlers.py
Outdated
time.sleep(5) | ||
success_status, status = handler_i.is_ext_status_success(dependency.ext) | ||
|
||
# In case of timeout, we log it and continue with the rest of the extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct behaviour.
The extension should not be run if its dependency timed out. The extension runs once this function returns, unless this function throws an exception.
azurelinuxagent/ga/exthandlers.py
Outdated
dependencies = sum([e.dependencies for e in ext_handler.properties.extensions], []) | ||
for dependency in dependencies: | ||
handler_i = ExtHandlerInstance(dependency.handler, self.protocol) | ||
if dependency.timeout is not None and dependency.timeout < DEFAULT_EXT_TIMEOUT_MINUTES: | ||
wait_until -= datetime.timedelta(seconds=( (DEFAULT_EXT_TIMEOUT_MINUTES - dependency.timeout) * 60)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this logic been removed entirely?
Further divergence from the Windows agent is not desirable.
c3bdb97
to
7663c56
Compare
Move dependencyLevel parsing from plugin inside plugins to dependsOn Use tuples of handler name and extension name to represent dependencies
An extension's dependencies must report a success status before the agent should install and enable an extension.
The platform gives up to 90 minutes of processing for each dependency level. The first level should have no dependencies, so any waiting should return a status right away. The dependencies of each subsequent level are allocated an additional 90 minutes of processing time to produce a valid status. If the agent has been processing extensions for more than (90 minutes) * (number of levels processed so far) then it should timeout.
d261306
to
5a9a81e
Compare
azurelinuxagent/ga/exthandlers.py
Outdated
self.ext_handlers.extHandlers.sort(key=operator.methodcaller('sort_key')) | ||
for ext_handler in self.ext_handlers.extHandlers: | ||
# TODO: handle install in sequence, enable in parallel | ||
# Use updated time limit for each dependency level | ||
if dep_level != ext_handler.sort_key(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply move through the extHandlers sorted by dependency level, and remove all the additional wait code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We not only make sure the extensions are installed sequentially, we need to make sure that the previous extension in the sequence is installed successfully before proceeding with the next extension in the sequence. Hence this wait code with different timeout value for each level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider something like this:
for ext_handler in extHandlers:
handle_ext_handler(ext_handler)
wait_if_necessary(ext_handler)
Seems a little bit simpler than tracking full dependencies per se. We really just care if the last handler is done, if there is a stated dependency.
Regarding the timeouts, I thought we had decided on a total 90?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @hglkrijger that the for loop can be simplified to not make reference to the dependencies, but just ext_handler. Are the dependencies used only to increase the timeout 90 min per dependency level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even now, the for loop does not make any reference to the dependency. It just references what was the previous extension installed in order to wait on that.
I think the one @hglkrijger suggests is that 'install an extension, wait for it'.This has a side effect of waiting for the highest level extensions as well, which is not required to be waited for.
Another way based on his suggestion is 'install an extension, wait for it if it is not the highest dependency level'. This requires calculating the highest dependency level in advance.
Do you think we need to change the logic to match this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per offline discussion we will got for a flat 90 min (instead of 90 per dependency level)
5a9a81e
to
c5d5bf1
Compare
azurelinuxagent/ga/exthandlers.py
Outdated
if not success_status: | ||
msg = "Timeout waiting for success status " \ | ||
"from dependency {0}/{1} for {2}" \ | ||
"status was: {3}".format(prev_handler.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status is an instance of ExtensionStatus, so format will produce something like "<...ExtensionStatus object at 0x000001BF0B6B7DA0>"
azurelinuxagent/ga/exthandlers.py
Outdated
if status.status != "success": | ||
return (False, status) | ||
for substatus in status.substatusList: | ||
if substatus.status != "success": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hglkrijger - what is the interpretation of status/substatus/status?... are extensions allowed to report "success" in status and, say, "error" in one of the status/substatus/status? is this considered an extension error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per offline discussion @sirajudeens will double-check the validation of substatus in CRP to be sure this matches.
try: | ||
depends_on_level = int(getattrib(depends_on_node, "dependencyLevel")) | ||
except (ValueError, TypeError): | ||
depends_on_level = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to catch these errors? this bumps the priority to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect string value corresponding to an integer value, which is parsed to get the integer value.
In order to guard against any corrupted value, we catch this exception. The dependency level is set to default in in that case. We assume that that kind of extension does not depend on anything so it gets its priority bumped up. Do you have any better suggestion in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I double-checked the previous code and it was doing the same (setting the priority to 0 on error). I think we should log a warning in case bumping it up to 0 affects any dependencies and causes an extension failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warning is going to create a lot of noise in the logs if the platform is only sending down dependencyLevel for the cases where dependencies exist.
@sirajudeens are we sending down dependencyLevel all the time regardless or only when the customer specifies dependencies? What about for single VMs?
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -871,6 +917,34 @@ def collect_ext_status(self, ext): | |||
|
|||
return ext_status | |||
|
|||
def is_ext_handling_complete(self, ext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more useful for debugging and logging to define a __str__
method on ExtensionStatus and return the full status.
8056ed2
to
0d9a805
Compare
0d9a805
to
dc0ec36
Compare
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
if dep_level != ext_handler.sort_key(): | ||
return True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would cleanup this loop a little, maybe something like
first = self.ext_handlers.extHandlers[0].sort_key() if len(extHandlers) > 0 else None
for i in range(1, len(self.ext_handlers.extHandlers)):
if self.ext_handlers.extHandlers[i].sort_key() != first:
return True
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole method seems unnecessary. The agent should not wait for the leaf nodes. In the case where there are no dependencies, every node is a leaf node.
azurelinuxagent/ga/exthandlers.py
Outdated
self.ext_handlers.extHandlers.sort(key=operator.methodcaller('sort_key')) | ||
for ext_handler in self.ext_handlers.extHandlers: | ||
# TODO: handle install in sequence, enable in parallel | ||
if ext_seq_required: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest switching the order of the calls to handle_ext_handler and wait_for_prev_handler_completion (call handle first and wait second) and just add a check to avoid waiting for the last extension. I think this would express the intention of the code a little more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'd suggest something similar to
for i in range(len(self.ext_handlers.extHandlers)):
self.handle_ext_handler(self.ext_handlers.extHandlers[i], etag)
if ext_seq_required:
dep_level = ext_handler.sort_key()
if dep_level >= 0:
if i < len(self.ext_handlers.extHandlers) - 1 and !self.wait_for_handler_successful_completion(self.ext_handlers.extHandlers[i], wait_until):
break
plus updating the comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use for i in range(L)
in a Python 2 or Python 2/3 codebase such as this one. It creates a list of integers unnecessarily.
For this particular case, keeping track of i
is unnecessary. The second to last ext_handler
may also be a leaf node so the test needs to be against the maximum dependency level, not against its position in the list. if dep_level >= 0 and dep_level < max_dep_level:
ought to be a sufficient test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
azurelinuxagent/ga/exthandlers.py
Outdated
# If handled successfully, proceed with the current handler. | ||
# Otherwise, skip the rest of them. | ||
dep_level = ext_handler.sort_key() | ||
if dep_level >= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment to explain the >=0 (we do this only for install/enable and not for uninstall/disable)
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
return False | ||
|
||
def wait_for_prev_handler_completion(self, prev_handler, wait_until): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove 'prev_' from the name of this function and its parameters. The function just waits for a handler, by itself it doesn't have a concept of "previous handler"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest renaming the function to wait_for_handler_successful_completion or similar
azurelinuxagent/ga/exthandlers.py
Outdated
# No need to wait on anything for the very first extension | ||
if prev_handler == None: | ||
return True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if implementing the changes suggested for the handle/wait loop on line 320, then there is no need to keep track of the previous handler and this check for null is not needed
azurelinuxagent/ga/exthandlers.py
Outdated
# In case of timeout or terminal error state, we log it and return false | ||
# so that the extensions waiting on this one can be skipped processing | ||
if not ext_completed or status != EXTENSION_STATUS_SUCCESS: | ||
msg = "Extension {0} was timedout or status was: {1}".format(ext.name, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest rephrasing this message to something similar to "Extension {0} did not reach a terminal state within the allowed timeout. Last status was {1}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree. If the extension completes within the allotted time but returns an error, you still wind up emitting this message. Your suggestion makes that error sound like a timeout. The right thing to do is to cleanly separate the "returned an error" condition from the "did not complete within the allotted time limit" condition, emitting a clear (and distinct) message for each of those two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasonzio , is your concern about the error message or the functionality itself when there is a failure?
As per the last discussion with the VM agent folks, if an extension timedout/failed, we should stop processing the rest of the extensions, when the extension sequencing is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasonzio agree that 2 separate messages would be better
azurelinuxagent/ga/exthandlers.py
Outdated
if self.wait_for_prev_handler_completion(prev_handler, wait_until): | ||
prev_handler = ext_handler | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a message saying we stopped processing other extensions
azurelinuxagent/ga/exthandlers.py
Outdated
# Missing status file is considered a non-terminal state here | ||
# so that extension sequencing can wait until it becomes existing | ||
if not os.path.exists(ext_status_file): | ||
ext_status = ExtensionStatus(seq_no=seq_no) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there no need to return an ExtensionStatus object. I think only the 'status' field is used by the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only the 'status' field is used by the code
It will be easier to debug if we have the entire status object and not just the fields used by the code. Also, it is impossible to tell whether it is a status or a substatus otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more substatus per @narrieta 's suggestion.
azurelinuxagent/ga/exthandlers.py
Outdated
ext_status = ExtensionStatus(seq_no=seq_no) | ||
ext_status.message = u"Failed to get status file for {0}".format(ext.name) | ||
ext_status.code = -1 | ||
ext_status.status = "warning" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"no status file" or similar may be more appropriate
azurelinuxagent/ga/exthandlers.py
Outdated
def is_ext_handling_complete(self, ext): | ||
status = self.get_ext_handling_status(ext) | ||
|
||
# when there is no status, the handling is complete and return None status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the comment... this condition would be reached when seq < 0 (i.e. no new user settings) rather than when there is no status file
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -310,11 +311,63 @@ def handle_ext_handlers(self, etag=None): | |||
logger.info("Extension handling is on hold") | |||
return | |||
|
|||
wait_until = datetime.datetime.utcnow() + datetime.timedelta(minutes=DEFAULT_EXT_TIMEOUT_MINUTES) | |||
prev_handler = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable is not referenced anymore
azurelinuxagent/ga/exthandlers.py
Outdated
# We wait only for the installation. Not for the uninstallation. | ||
# If handled successfully, proceed with the current handler. | ||
# Otherwise, skip the rest of the extension installation. | ||
dep_level = ext_handler.sort_key() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is done for install and enable; let's update the comment to reflect that
azurelinuxagent/ga/exthandlers.py
Outdated
dep_level = ext_handler.sort_key() | ||
if dep_level >= 0 and dep_level < max_dep_level: | ||
if not self.wait_for_handler_successful_completion(ext_handler, wait_until): | ||
logger.warn("Skipping processing extensions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the extension name and rephrase the message to read something similar to "Extension {0} failed or timed out, will skip processing the rest of the extensions"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be too much logging. We already log the extension name in the wait function. This is to just log that the rest of the extensions will be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can leave the extension name out if you prefer so. Let's be explicit about why we are skipping the rest of the extensions, though. Maybe something like "An extension failed or timed out, will skip processing the rest of the extensions". Hopefully this will be more helpful for CSS or a customer reading the log.
azurelinuxagent/ga/exthandlers.py
Outdated
return False | ||
|
||
if status != EXTENSION_STATUS_SUCCESS: | ||
msg = "Extension {0} did not succeed. Last status was {1}".format(ext.name, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: "Status was" instead of "Last status was"
058e721
to
fcffb8d
Compare
tests/ga/test_extension.py
Outdated
extension = Extension(name=handler_name) | ||
exthandler.properties.extensions.append(extension) | ||
|
||
# Override the timout value to minimize the test duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "timout"
tests/ga/test_extension.py
Outdated
ExtHandlerInstance.get_ext_handling_status = MagicMock(return_value=None) | ||
self.assertTrue(exthandlers_handler.wait_for_handler_successful_completion(handler, datetime.datetime.utcnow())) | ||
|
||
def _test_wait_for_handler_successful_completion(self, exthandlers_handler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove "test" from the name of this helper function. this threw me off... gave me the impression that it was a test
# Missing status file is considered a non-terminal state here | ||
# so that extension sequencing can wait until it becomes existing | ||
if not os.path.exists(ext_status_file): | ||
status = "warning" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should change this status... this gives the impression that the extension is reporting a warning when in actuality there is no status file. I would suggest return something like "status file not found" or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is verified against a set of valid values. I think we should adhere to those values. Also it is clearly commented above that condition in which it can occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that if we reach this path, then the status in this message:
"Extension {0} did not reach a terminal state within the allowed timeout. Last status was {1}".
will be "warning". When looking at the log, it'll seem like the extension actually reported a warning, when in actuality it didn't report any status. That may be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass up the entire status, not just the status field, to give a more complete log message and add a __str__
to ExtStatus
so that it logs something sensible
tests/ga/test_extension.py
Outdated
return exthandlers_handler.wait_for_handler_successful_completion(exthandler, wait_until) | ||
|
||
def test_wait_for_handler_successful_completion_none(self, *args): | ||
test_data = WireProtocolData(DATA_FILE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment about what each test is verifying or, better, a more descriptive test name... for this one maybe "test_wait_for_handler_successful_completion_returns_false_when_there_is_no_status_file"
tests/ga/test_extension.py
Outdated
test_data = WireProtocolData(DATA_FILE) | ||
exthandlers_handler, protocol = self._create_mock(test_data, *args) | ||
|
||
ExtHandlerInstance.get_ext_handling_status = MagicMock(return_value=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this test verifying? that wait_for_handler_successful_completion returns false when there is no status file? If that is the case then I would instead mock get_status_file_path to return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When get_ext_handling_status() is mocked, it is unnecessary to mock the get_status_file_path(). It will be a no-op in that case.
The case which you are referring is covered in test_get_ext_handling_status()
tests/ga/test_extension.py
Outdated
ExtHandlerInstance.get_ext_handling_status = MagicMock(return_value=status) | ||
self.assertFalse(self._test_wait_for_handler_successful_completion(exthandlers_handler)) | ||
|
||
def test_wait_for_handler_successful_completion_timeout(self, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test needs to be re-written... the way it reads now it looks like it is testing that wait_for_handler_successful_completion returns false when the status is "warning" (the test name seems to imply that this should be testing when the extension times out before reporting a terminal status)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"warning" is a non-terminal state which will eventually timeout the wait function and return false. This is what is tested here. We have set the timeout value to be 5 seconds to minimize the duration.
Do you mean to rename this function? If it needs to be rewritten, can you please explain what is missing here?
tests/ga/test_extension.py
Outdated
exthandler.properties.extensions.append(extension) | ||
|
||
# list of [seq_no, ext_status_file, is_status_file_exist, expected_result] | ||
test_cases = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add comments about what each test case is verifying? e.g. the first 2 seems to be checking the result when there are no new test settings, the 3rd one seems to be verifying the case when the extension doesn't create a status file, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already commented in line# 765.
fcffb8d
to
849e3e0
Compare
|
||
|
||
# Wait for the extension installation until it is handled. | ||
# This is done for the install and enable. Not for the uninstallation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this comment is entirely accurate.
"install and enable" refer to transition steps that handle_ext_handler performs to bring an extension from the uninstalled state to the disabled/installed state and from the disabled/installed state to the enabled state. handle_ext_handler
takes care of these steps. sort_key
is based on the goal state for the extension ("enabled" or not "enabled" which is either "disabled" or "uninstall"). If an extension is uninstalled and the goal state is "disabled", then handle_ext_handler
will install but not enable the extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial ask for a comment similar to this was to clarify the intention of the "dep_level >= 0" check... we wait only when the extension was installed or enabled. As it is currently phrased it seems like it is not helping so feel free to remove it or change it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; automation was successful
BVTs OK |
Adding support for extension sequencing in the azure linux agent
Issue #
VMSS extensions can be installed and enabled in a sequence by specifying dependsOn property values in the ARM template. CRP processes the dependency data, builds dependency graph and generates dependency level for each extension. The dependency level information is passed on to the VM agent. The VM agent needs to parse the dependency data and install the extensions based on the dependency level.
PR information
Quality of Code and Contribution Guidelines